Introduction
What is Cancer?
“Cancer is the name given to a collection of related diseases. In all types of cancer, some of the body’s cells begin to divide without stopping and spread into surrounding tissues.”(National Cancer Institute)
A Cancer can start anywhere in the human body, which is made up of trillions of cells. Normally, human cells grow and divide to form new cells as the body needs them. When cells grow old or become damaged, they die, and new cells take their place. When cancer develops, however, this orderly process breaks down.
How does cancer form? (National Cancer Institute)
I. Overview and Motivation
Nowadays, the number of cancers is increasing rapidly worldwide. According to the American cancer society, lung cancer is the second most common cancer in both men and women. As we know, the probability to develop such a cancer increases if we are a smoker.
For this reason, we would like to analyze the relationship between lung cancer and the intensity of smoking cigarettes to see how strong their correlation is. Tobacco smoking has been one of the world’s largest health problems for many decades. For the entire 20th century, it is estimated that around 100 million people died prematurely because of smoking, most of them in rich countries. According to the Global Burden of Disease study, more than 8 million people died prematurely as a result of smoking in 2017.
Simultaneously, the pollution in the air is also a leading risk factor for death worldwide. According to the World Wealth Organization (WHO), the air pollution in the poor countries is increasing whereas the air pollution in the rich countries is decreasing. In the developing countries, the pollution causes on average seven millions deaths per year. The goal of our project is to analyze the relationship between cancer incidence (overall and lung cancer), gdp per capita, air pollution and the intensity of smoking, meaning the number of cigarettes smoked per day. We highly believe that these three variables are correlated in some ways. According to us, health issues and more particularly cancers deserve deep intention as these cause a large number of deaths worldwide.
III. Research questions
Is the risk of developing lung cancer positively correlated with the number of cigarettes smoked per day?
Is there a relationship between the gdp per capita of a country and the risk of developing cancer?
Does the number of cigarettes smoked per day increase when the gdp per capita increases over time ?
Are people exposed to higher air pollution more likely to develop cancers?
Can we see that in areas where the air is the most polluted the incidence of risk of having lung cancers is also the highest?
Data
Sources
- Import data from csv file to dataFrame and check the importation (check that the number of imported rows/columns match with the number of rows/columns in the original file)
cancer<- read_csv(file= here::here("data/cancer-deaths.csv"))
Air<- read_csv(file= here::here("data/Airglobal.csv"))
smokers <- read_csv(file= here::here("data/smokglobal.csv"))
Lungcancer <- read_csv(file= here::here("data/Lungcancer.csv"))
gdp <- read_csv(file= here::here("data/gdpglobal.csv"))
cancerValue <- read_csv(file= here::here("data/cancervalue.csv"))
continent <- read_csv(file=here::here("data/continent.csv"))- Description
To answer our research questions we have chosen 7 different datasets coming from various sources.
The two datasets1 about cancer (overall and lung cancer) have both 185 observations and 3 variables. For each country the cancer table reports the information about the estimated cumulative risk of incidence in 2018, for both sex and ages between 0 and 74. The cumulative incidence is calculated as the number of new events or cases of disease divided by the total number of individuals in the population at risk for a specific time interval. Researchers can use cumulative incidence to predict risk of a disease or event over short or long periods of time.
A dataset2 containing the cancer deaths, the cancer death rate, and the age-standardized death rate for each country between 1990 and 2017 was also taken, it has 6468 observations and 6 variables. We chose to use this dataset to use the variable representing the number of cancer deaths per country.
The fourth dataset3 we chose is about daily cigarette consumption per smoker per day for each country. In fact the extent of smoking is not only determined by the prevalence of populations who smoke, but also by the intensity of smoking. This is measured as the average number of cigarettes consumed by smokers daily. This dataset contains 6204 observations and 4 variables. The data was collected between 1980 and 2012. These datasets are related since many scientific works have shown that smoking increases the odds of developing diseases.
The 5th dataset4 is about the air quality. The table contains 580 observations and 5 variables collected in 2016. In this table, the concentration of fine particles in the air per country is determined. In fact, The mean annual concentration of fine suspended particles of less than 2.5 microns in diameters is a common measure of air pollution. The mean is a population-weighted average for urban population in a country. The higher the value is the more the population of the country is exposed to fine particles. Air pollution consists of many pollutants, among other particulate matter. These particles are able to penetrate deeply into the respiratory tract and therefore constitute a risk for health by increasing mortality from respiratory infections and diseases.
The 6th dataset5 we use to answer our research questions is about the GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. The dataset consists of 6407 observations and 4 variables, the GDP per capita is given for each country between 1990 and 2017. Although these data tables do not cover exactly the same periods of time. All the datasets are recent enough ( 2012, 2016, 2017, 2018) to make estimates and analyze possible correlations between the variables.
We finally used a dataset6 containing all the countries and the continents to which they belong in order to be able to make analyses at the global level and to see the differences between the continents.
- Datasets sources:
Wrangling/cleaning
In this part we clean, prepare and manipulate the data. The goal is to understand the data, identifiy the target variables and the dependant variables.
We gather several tables together, this will make the EDA and Analysis part later easier. Since we want to see the relationship between cancers and the other variables.
Spotting mistakes, missing data, listing anomalies and outliers
We observed that there are missing values that need to be removed. In addition, we also noticed that there are outliers in some datasets. For example in Suriname the number of cigarettes smoked per day is around 108.9, this value seems very high compared to the value in other countries. That’s why we compute the Q1, Q3 and IQR to remove all the potential outliers of the dataset.
We finally decided to add a dataset with the names of each continent in order to have a more global vision. Moreover we notice that there is an error in the continent dataset. Armenia is present twice in the dataset, once it belongs to the European continent and the other time to Asia. Once the error is corrected we obtain these summary tables.
In the cancerSummary table (Table2), Value_all represents the incidence of risk of having all type of cancers, Value_lung the incidence of risk of developping lung cancer, Value_air represents the PM2.5 concentration in the air, in other words the air quality and finally the colomn number represents the number of cigarettes smoked per day for each country.
Exploratory data analysis
Mapping out the underlying structure
Relationship between the datasets
This figure helps to understand the relations between the cancerSummary table and all the other datasets, indeed this table contains the information of all important variables.
Identifying the most important variables
The cancerSummary table and the figure above group together all the important variables that we will use during our project
In this Exploratory data analysis part, we want to understand the important variables and how they relate to each other. We look at each dataset and get an overall overview of the data using several visualizations.
Univariate visualizations
The goal of the univariate visualizations is to have an overall overview of the different datasets. The univariate data analysis includes discretes variables vs continuous variables which we are going to describe for each dataset. For each analysis we have taken the most recent year of the dataset, indeed we want to make the most representative analyses of the current situation.
Moreover, regarding the histograms, log transformations were used. Indeed, they make highly skewed distributions less skewed. This can be valuable for making patterns in the data more interpretable.
Cancer dataset
First we look at the cancer dataset. The dataset has 3 discrete variables (Entity, Code and Year) and 3 continuous variables that represent the number of deaths in 2017.
The cancer boxplot: The boxplot is almost inexistant as the number of outliers is very large. We don’t see the tails or the median. This means that too many outliers are present in the dataset meaning that the number of deaths varies a lot. These outliers must be removed for a further analysis which will take place in the next step of the project.
The cancer histogram: We used the log transformation to reduce the skewness of our original data, this makes the data more “normal” so that we can better interpret this histogram. Indeed it has a pretty long tail in the left side but in general present well the datas.
Cancer Value dataset
We now look at the dataset named cancerValue. In the cancerValue dataset, the dicrete variables are: Population and ISO Code and the continuous variable is the variable Value (representing the incidence of risk in 2017).
The cancerValue boxplot: The tail going down is short whereas the tail going up is quite long. We have no outliers and the median is just above 15. Half of the observations have a cumulative incidence to develop cancer between 12.1 and 24.5. These numbers are higher than for the lung cancer which makes sense because here we take into account all cancer combined. On average, the population worldwide has a cumulative incidence to develop a cancer of 18.4.
The cancerValue histogram: In the histogram, we can see that on average people have an incidence to develop cancer of a value just under 20. This is represented by the red line in the histogram. Values near 20 (above and below 20) are also frequent in the dataset.
Lung cancer dataset
Let’s now look at the lungcancer dataset. The discrete variables are : ISO code and Population and the continuous variable is Value.
The Lungcancer boxplot: We can see that the tail that is going down is short whilst the tail that is going up is long. The median is 1.5 and is represented by the black line. We have one outlier which is represented with a circle at the top of the boxplot. 50% of the population has an incidence between 0.56 to 3.1 to develop a lung cancer. On average, the population has an incidence of 1.5 to develop a lung cancer.
The Lungcancer histogram: The histogram shows that on average, the population has a risk to develop a lung cancer of a value of around 2. We still have a lot of countries where the incidence to develop such a condition is higher, up to 8 approximately.
Cigarettes dataset
The discrete variables in this dataset are: Entity, Code, Year and the continuous variable is the number of cigarettes smoked per day. For the continous variable, we can now look to the central tendency and the spread of the datas.
The smokers boxplot: The median is 17.5 and is reprensented by the black line in the boxplot. The tails going up and down are quite short. We have three outliers (circles) which are not common values. The interquartile range box represent 50% of the data which means that 50% of the smokers smoke between 11.8 and 23.5 cigarettes per day.
The smokers histogram plot: We can see in the histogram that most of the values are concentrated between 10-40 as the bins corresponding to these values are the highest. The histogram is pretty symmetric as the tails going left and right have similar sizes. This means that there is a similar proportion of the population who smoke a few cigarettes a day and a lot per day.
Gdp per capita dataset
Let’s look at the gdp dataset. In the gdp dataset, the dicrete variables are: Entity, Code, Year and the continuous variable is the value of the GDP per capita.
The gdp per capita boxplot: The tails are really short (especially the lower one). We can see that we have a large number of outliers. Half of the countries in this dataset have a GDP between 3021 and 19 608. A few countries have a GDP which is much higher (can go up to 135 319). However, some of these countries are considered as “outliers”. The black line in the boxplot is the median meaning that half of the distribution has a gdp per capita above and below 14 926.
The GDP per capita histogram: The histogram is quite symmetric as the previous one. The red line represents the average GDP per capita in the dataset which is 18 958.
Air dataset
Finally, we look at the Air dataset. All the variables are discrete except the last one representing the pollution rate. In order to describe the boxplot, we use the air1 dataset in which the air value was transformed into a number.
The Air boxplot: The boxplot is quite symetric with a longer tail going up than down. We have however some outliers presented a the top of the boxplot. Half of the observations in this dataset have a concentration of fine particles between 15 and 30 on average. Some countries have a concentration of particles which is much higher and some of these countries are classified as “outliers” in our analysis as we can see in the boxplot.
The Air histogram: This a pretty symmetric histogram. We can say that on average, the pollution in the air we breathe has a value between 10 and 40 approximately. A few countries have a lower or higher pollution than this.
Multivariate visualizations
In this part we will analyze and explore the behavior of the variables between them.
Cancer is one of the leading causes of death, that’s why we will first of all see the evolution of the number of people dying of cancer over time in each continent.
From these graphs it can be seen that the average number of cancers deaths is significantly increasing over time in each continent. The continents with the highest number of cancer deaths are Asia and Oceania. However, in Africa there are the fewest cancer deaths.
Now that we have seen a significant increase in the number of cancer deaths around the world, we are going to take an in-depth look at the risk of developing cancer, especially lung cancer, around the world.
- Analysis of the risk of developing all type of cancers and lung cancer worlwide
These boxplots show that the risk of developing cancer is the highest in Oceania and Europe and the lowest in Africa. However, the risk of developing lung cancer is also the highest in Europe, Oceania and Asia and is much lower in Africa.
Now that we know from a global perspective the continents where the risk of developing cancer is the highest, we can see at the country level where the risk is the highest with the help of barplots.
As seen above, the countries with the highest risk of developing any type of cancer and lung cancer are in Europe and Oceania. We notice that Belgium and Hungary are among the countries where there is the highest risk of developing all types of cancer and lung cancer.
The pie chart below shows that the risk of developing lung cancer is about 1/8 of the risk of developing any type of cancer.
Now that we have been able to see from various graphs that the number of deaths and the risk of developing cancer and especially lung cancer is relatively high worldwide. We will take a closer look and explore the variables that could influence the risk of developing cancer, such as the number of cigarettes smoked and the air pollution.
- Analysis of the number of cigarettes smoked per day worldwide
First of all we will look at the global level, i.e. on a continental scale, then we will see more precisely the countries where the most cigarettes are smoked per day.
According to these boxplots we can compare the number of cigarettes smoked per day for each continent. Most of these boxplots are approximately at the same height but the overall median shows that it is in Europe and South America where the number of cigarettes smoked per day is the highest (median of 20 and 19.7). Moreover in Europe values are much less dispersed than in other continents.We can establish a link between these boxplots and those seen above representing the risk of developing cancer and lung cancer, since the risk was highest in Europe (the continents where the most cigarettes are smoked) and lowest in Africa (the continents where they smoke the fewest cigarettes per day).
Now we will see the top 6 countries with the highest number of cigarettes smoked per day.
Surprisingly, 2 out of 6 countries are part of the African continent. This can be explained by the fact that the dispersion of the number of cigarettes smoked is very large in Africa, as it can be seen in the boxplot above. Indeed the median remains around 12 cigarettes while it is 20 in Europe.
We are now going to study the variable air pollution, indeed a more polluted air could increase the risk of developing cancer, especially lung cancer since inhaling polluted air can interfere with the growth and function of the lungs.
- Analysis of the quality of the air worldwide
At the country level, in the barplot below we can see that the 6 countries where the air is the most polluted belong to the Asian and African continent.
Now we are going to analyse the GDP per capita at the continental level.
- Analysis of the GDP per Capita worldwide
From these boxplots it can be seen very clearly that Europe is the continent with the highest GDP per capita (median of about 29,800). Asia is the 2nd continent with the highest GDP per capita, however there are also 3 outliers which represent an extremely high GDP per capita (116 935 or 85 535) compared to the median of 12,600.
Africa is the continent with the lowest GDP per capita, its median is 2 660 and the upper fence is about 16 562, more than 1.5 times lower than the European GDP per capita median.
The barplot of the countries with the highest GDP per Capita illustrates the boxplots seen above, in fact the 6 countries belong to the European and Asian continent.
The GDP per capita varies a lot from one country to another, so it may be interesting to look at its evolution over time by country and not by continent.
In the graphs below we have the evolution over time of the gdp per capita for some countries of each continent.
In these graph we see that for the majority of the countries the gdp per capita increases a lot over time. We also notice large differences between countries (therefore the scales used are different for each country ).Correlation matrix
The correlation matrix helps understanding the relationship between the important variables. It is also useful to summarize data, as an input for the analysis part.
This matrix shows the correlation coefficients between variables.
We see directly that there is a zero or negative correlation with the set of variables and value_air, ie air pollution. For the rest of the variables there are positive correlations between them (between some variables it is low but still positive). We will study in detail in the “Analysis” section the different relationships between the variables.
Summary table
- Summary statistic for each important variable from the CancerSummary dataset:
In this part of the project, we calculate the statistics (minimum, 25th quantile, median, mean, 75th quantile and maximum) of our most important variables and summarize everything in the table below.
| Stat | valueair | GDP | valueall | valuelung | number |
|---|---|---|---|---|---|
| Min. | 5.88 | 702 | 7.4 | 0.020 | 1.0 |
| 1st Qu. | 14.05 | 5106 | 12.0 | 0.662 | 11.9 |
| Median | 19.45 | 13641 | 15.9 | 1.600 | 17.7 |
| Mean | 25.48 | 19130 | 18.6 | 1.962 | 17.5 |
| 3rd Qu. | 32.55 | 27114 | 24.6 | 3.175 | 23.2 |
| Max. | 94.33 | 116936 | 41.9 | 7.000 | 40.3 |
This table shows that there is a large dispersion between the observations of each variable, which proves that the data vary a lot from one country to another. We have the statistics for each variables. We can see that for instance in average, people smoke 17.5 cigarettes a day, the average GDP pr capita is 19 130 ranging from 702 to 11 6936.
We will analyze more precisely the relationships between the different variables in the next part.
Analysis
Answers to the research questions
2. Is there a relationship between the gdp per capita of a country and the risk of developing cancer?
In the previous section EDA it was noticeable that there were outliers in the gdp per capita dataset. Once we remove the outliers, the relationship between the gdp per capita and the risk of developing cancer in each country is analyzed.
The linear regression shows a positive relationship between the gdp per capita of each country and the risk of developing cancer.
The intercept and the slope are statistically different from 0. It means that, when the gdp per capita increases by 1000, the average risk of developing cancer increases by 0.422.
How well does this model fit the data?
- Summary of the residuals: The distances between the observations and the model
The residuals look pretty symmetrical around 0, suggesting that the model fits the data well.
The 0.95 confidence interval shows that the model fits our data well, there is little variability within the observations.
The R2 indicates that approximately 59% of the variation in the risk of developing cancer is explained by the gdp per capita in this model.
To conclude, the risk of developing cancer increases with the country’s gdp per capita.
How to explain these results?
Several reasons can explain these findings. As we have seen above, the higher the number of cigarettes smoked the higher the risk of developing cancer. However, we can assume that the higher the gdp per capita, the more people can afford to buy cigarettes. This can be one reason why the risk of developing cancer is positively related to the gdp per capita (this hypothesis is tested in question 3).
In addition, access to health care is significantly lower in countries where the gdp per capita is low according to the World Health Organization (WHO1). Indeed, in some countries data are less reliable and health services are not equipped to prevent, diagnose and treat cancer. This would also explain why the incidence of developing cancer is higher in countries where the gdp per capita is high.
3. Does the number of cigarettes smoked per day increase when the gdp per capita increases over time (between 1990 and 2012) ?
To answer this question we first analyzed the number of cigarettes smoked per day according to the evolution of the gdp per capita over time by country. The analysis was conducted for Europe, one of the continents where the risk of developing cancer is the highest. Then, the study analyses the case of Africa, where the risk of developing cancer is the lowest according to our data.
In the graphs below, the blue lines represent the evolution of the gdp per capita and the barplots show the evolution of the number of cigarettes smoked per day between 1990 and 2012.
- Europe
For some countries in Europe such as in France and Germany, the consumption of cigarettes and the gdp per capita are evolving in opposite direction. However, in other countries, such as Denmark and Estonia, the number of cigarettes smoked per day increases with an increase in the gdp per capita.
- Africa
The conclusion remains the same for the second graph. It is indeed difficult to draw a general conclusion for all countries. Some countries experience an increase in the consumption of cigarettes when the GDP per capita increases and some other countries witnessed the exact opposite observations. What we can say for sure is that in overall, all African countries have a gdp per capita that remains relatively constant (or increases slightly) over the years unlike the European countries.
From these graphs, it is difficult to have an overall overview of the countries as each of them experience different observations.
Finally, using the information provided by the graphs above, we will analyze whether worldwide there is a positive relationship or not between the number of cigarettes smoked and the gdp per capita using a linear regression.
Relationship between the gdp per capita and the number of cigarettes smoked
To analyze and predict the relationship between these two variables, we take the most recent observations for each variable (2017 and 2012) in order to have the more realistic results to the current situation.
This graph shows that there is a positive correlation between the gdp per capita and the number of cigarettes smoked per day.
The intercept and the slope are statistically different from 0. It means that, when the gdp per capita increases by 1000, the number of cigarettes smoked per day increases by 0.25.
How well does this model fit the data?
- Summary of the residuals:
The residuals look roughly symmetrical around 0, suggesting that our model fits not perfectly the data.
The 0.95 confidence interval shows that there is variability within the observations.
The R2 shows that only 16% of the variation in the number of cigarettes smoked per day is explained by this linear model.
To conclude, the higher the country’s gdp per capita, the higher the number of cigarettes consumed.
How to explain these results?
This relation can be explained by the fact that the higher the gdp per capita, the more individuals can afford to buy cigarettes.
Nevertheless, prevention policies may be one of the reasons why the number of cigarettes smoked per day has declined relative to the increase of the gdp per capita in some countries. Indeed, in France and Germany for example, more tobacco control policies are set up than in other countries where fewer resources are devoted to prevention (WHO2).
4. Are people exposed to higher air pollution more likely to develop cancers?
In order to answer this question a linear regression between the air pollution and the risk of developing cancer is calculated.
In these graphs, it is noticeable that there is no positive correlation between air pollution and the risk of developing cancer or lung cancer. Moreover, the linear regressions even indicate that the more polluted a country’s air is, the lower the risk of developing cancer.
How to explain these results?
This result seems completely illogical and shows the limitations of our data. Indeed, we can see that Africa (red points) is one of the continents where the air is highly polluted, however the risk of developing cancer remains very low compared to Europe for example. One of the reasons is that cancer is not recognized as a high-priority health problem in most of low and middle income countries (Institute of Medicine3). Additional observations and variables measuring air pollution and cancer, would be needed to evaluate whether higher air pollution increases the risk of developing cancer.
In conclusion, it cannot be stated that higher air pollution is related to higer risk to develop cancer based on the data.
However, we can clearly see that countries with higher air pollution are those with a lower gdp per capita and belonging to the African continent (red points). As a result, the continent where the gdp per capita is the highest, that is to say in Europe, the air is also the less polluted (green points).
Therefore an analyze of the relationship between the air pollution and the gdp per capita of a country is conducted.
This graph confirms the assumption made above, the air is the most polluted in countries with a low gdp per capita. We clearly see that it is in Asia and Africa that the air is the most polluted (red and blue points). Africa being among the continents with the lowest gdp per capita in each country.
The intercept and the slope are statistically different from 0. When the gdp per capita increases by 1000, the concentration in PM2.5 decreases by 0.34.
How well does this model fit the data?
- Summary of the residuals:
The residuals are not quite symmetrical, other quality indicators have to be evaluated to identify the relevance of this model.
The 0.95 confidence interval shows that there is variability within the observations but that the model fits relatively well the data.
The R2 shows that 23% of the variation in the PM2.5 concentration is explained by the linear model.
To conclude, there is a negative relation between the gdp per capita and the air pollution. Higher air pollution in a country is associated with a lower gdp per capita.
5. Can we see that in areas where the air is the most polluted the incidence of risk of having lung cancers is also the highest?
As seen above, the areas where the air is the most polluted are the one where the countries have a lower gdp per capita. This is the case in Asia and Africa. That’s why the relationship between the air pollution and the risk of developing lung cancer is analyzed in each country of these continents.
This regression line means that there is no relationship between the air pollution and the risk of developing lung cancer in Africa and Asia. The values are just randomly scattered on the grid.
The intercept and the slope are not statistically different from 0 in this model. It means that changes in the value of risk of developing lung cancer are not associated with changes in PM2.5 concentration.
Moreover, the confidence interval is wide, this means that the sample is too small. Indeed if the dispersion is high, the conclusion is less certain and the confidence interval becomes wider.
To conclude, the data and this linear model do not allow us to conclude whether polluted air increases the risk of developing lung cancer in areas where the air is the most polluted.
This shows that a predictive model is generally more dependent on the quantity and quality of the data and the care taken in their preparation and selection, than on the modeling technique itself.
Different methods considered
Correlation Matrix : Pearson correlation coefficient were used to examine the strength and direction of the linear relationship between the continuous variables. The correlation coefficient can range in value from −1 to +1. The larger the absolute value of the coefficient is, the stronger the relationship between the variables. However, based on correlation calculations, it cannot be inferred that there are cause and effect relationships between the variables.
Linear regressions were used to predict the risk of developing cancer based on a variety of variables. This method is performed on a dataset to predict the response variable based on a predictor variable or used to study the relationship between a response and predictor variable. Using regressions to make predictions doesn’t necessarily involve predicting the future. It predicts the mean of the dependent variable given specific values of the dependent variable.
Multiple R-squared: The R2 value is a measure of how close the data is to the linear regression model. The values are always between 0 and 1; numbers closer to 1 represent well-fitting models.
P-value: The p-value is associated with the F statistic and is used to interpret the significance for the whole model fit to the data.
Confidence interval : It is a range of values that contains the true mean of the population. As the sample size increases, the range of interval values will narrow, meaning that the mean can be estimated with much more accuracy compared with a smaller sample.
Competing approaches
Other approaches can be considered to model and predict the risk of developing cancer. However, they require a larger amount of data and are more complex to implement.
Decision tree algorithms are if-else statements that can be used to predict a result based on data. It is a supervised machine learning algorithm. In the case of our project we could use a regression tree to predict the incidence of risk of having cancer, which is a continuous dependent variable. That depend on continuous factors like the number of cigarettes smoked, as well as the air pollution and the gdp per capita of the country. In a regression tree, a regression model is fit to the target variable using each of the independent variables. After this, the data is split into several points for each independent variable. At each such point, the error between the predicted values and actual values is squared (SSE) and compared across the variables. Finally, the variable or point which has the lowest SSE is chosen as the split point. This process is continued recursively.
K-Nearest Neighbors (KNN) makes a prediction for a new observation by searching for the most similar training observations and pooling their values. This method relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data. KNN captures the idea of similarity, The advantages are that it’s simple and easy to implement, there is no need to build a model, it tunes several parameters, or makes additional assumptions. The algorithm is versatile. It can be used for classification, regression, and search problems. However this algorithm gets significantly slower as the number of examples and/or predictors/independent variables increases.
Neural networks like MLP (multilayer perceptron) can learn complex patterns using layers of neurons which mathematically transform the data. The layers between the input and output are referred to as “hidden layers”. A neural network can learn relationships between the features that other algorithms cannot easily discover. Nevertheless it requires a very large amount of data and a long training time.
Justifications
In this project the modeling method chosen is linear regression, it was one of the most suitable methods to our project because there was not a large number of observations in most of our datasets and it was an efficient and adapted method to answer our research questions. Indeed, linear regressions are one of the simplest and most common supervised machine learning algorithms used for predictive modeling. This model is used to measure the influence of one quantitative variable on another quantitative variable.
The advantages of this method are that it is quick to compute, it can be easily updated with new data, it is relatively easy to understand and to explain. However it does not allow to understand complex relationships and is difficult to capture non-linear relationships.
Conclusion
Take home message
The number of cancers is increasing rapidly worldwide. Based on the analysis conducted in this project, the incidence of developing cancer varies depending on numerous variables. An interesting aspect is the fact that lung cancer represents only 1/8 of all types of cancer. This has to be taken into account because even though lung cancer is the second most common cancer worldwide, there are various types of cancer other than lung cancer that can impact both men and women’s lives.
The analysis between the intensity of smoking and the risk of developing lung cancer showed that there is a relation between the number of cigarettes smoked and the risk to develop cancer and lung cancer. However, smoking is not the only factor which is related to the increase of the risk of developing cancer and lung cancer. Indeed, the correlation between the GDP per capita and the incidence to develop cancer is stronger than the one between the number of cigarettes smoked and the risk to develop a cancer. One reason may be the fact that wealthy countries have the ability to purchase more cigarettes which affects health. Another reason may be that in those countries, people have more access to health and care services. Therefore there are more people being diagnosed. Moreover, diet is also an important aspect to take into account because in wealthy countries such as America, people do often have poor diet choices which lead eventually to health issues.
The idea of this project was also to investigate the link between air pollution and cancer to see if the quality of air breathed can be related to health. What was found is that the air is more polluted in Asia and Africa but these continents have the lowest risk to develop cancer. As a consequence, there is no evident correlation between the air pollution and cancer which supports one of the related work mentioned in the introduction of this project. Therefore, other air quality measures could be considered to study the relationship between these two variables. Overall, the risk to develop cancer is strongly correlated with an increase in the GDP per capita, a moderate increase in the number of cigarettes smoked per day but no clear correlation with the air pollution.
Limitations
This study has enabled us to analyze the relationships between cancer and other factors such as the number of cigarettes smoked per day, the GDP per capita and the air pollution. Even though we have identified some correlations between these variables, it is important to bear in mind the fact that the variables chosen in this project explain just a part of the risk of developing cancer. A lot of other variables can be linked to the risk of developing cancer. For instance, diet as well as stress can have a huge impact on the body and therefore can be related to the risk of developing a medical condition. These aspects are not taken into account in our analysis.
Furthermore, the amount of data was limited and included outliers. Moreover, we assumed that the collection of data was the same worldwide. Indeed, we do not take into account the quality of gathering data and this can be problematic because we don’t know exactly how people gather data and how consistent they are with their statistics. It is also crucial to consider the timeline regarding the period considered due to the fact that some variables, such as the GDP per capita is constantly evolving over time. The air pollution also changes according to the waste management, soil erosion, deforestation and industrial pollution in the country. These changes can impact the health of the population and therefore the incidence to develop cancer or lung cancer.
Future work
Some additional research could be conducted on the topic in order to further investigate the relationship between cancer and its contributing factors. For instance, it would be interesting to study the relationship between diet choices and health issues. We live now in a society where rich countries are getting more and more overweight and stressed. In America, the consumption of processed food as well as fast-food is increasing which lead to obesity, diabetes and other health conditions.
In terms of politics, it would be also interesting to analyze smoking prevention in different countries to see if the differences in terms of legislation have an impact on people’s behavior and their smoking habits. Moreover, further analysis could be conducted with many different variables such as sun exposure, diet, physical inactivity, genetic inheritance, access to health care and so on. In fact, a lot of variables can be related to the risk of developing cancer.